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Abstract 

A new method is proposed for exploiting causal independencies in exact Bayesian network in- 
ference. A Bayesian network can be viewed as representing a factorization of a joint probability 
into the multiplication of a set of conditional probabilities. We present a notion of causal indepen- 
dence that enables one to further factorize the conditional probabilities into a combination of even 
smaller factors and consequently obtain a finer-grain factorization of the joint probability. The new 
formulation of causal independence lets us specify the conditional probability of a variable given its 
parents in terms of an associative and commutative operator, such as "or", "sum" or "max", on the 
contribution of each parent. We start with a simple algorithm VE for Bayesian network inference 
that, given evidence and a query variable, uses the factorization to find the posterior distribution of 
the query. We show how this algorithm can be extended to exploit causal independence. Empirical 
studies, based on the CPCS networks for medical diagnosis, show that this method is more efficient 
than previous methods and allows for inference in larger networks than previous algorithms. 



1. Introduction 

Reasoning with uncertain knowledge and beliefs has long been recognized as an important research 
issue in AI (Shortliffe & Buchanan, 1975; Duda et al., 1976). Several methodologies have been 
proposed, including certainty factors, fuzzy sets, Dempster-Shafer theory, and probability theory. 
The probabilistic approach is now by far the most popular among all those alternatives, mainly due 
to a knowledge representation framework called Bayesian networks or belief networks (Pearl, 1988; 
Howard & Matheson, 1981). 

Bayesian networks are a graphical representation of (in)dependencies amongst random variables. 
A Bayesian network (BN) is a DAG with nodes representing random variables, and arcs representing 
direct influence. The independence that is encoded in a Bayesian network is that each variable is 
independent of its non-descendents given its parents. 

Bayesian networks aid in knowledge acquisition by specifying which probabilities are needed. 
Where the network structure is sparse, the number of probabilities required can be much less than the 
number required if there were no independencies. The structure can be exploited computationally to 
make inference faster (Pearl, 1988; Lauritzen & Spiegelhalter, 1988; Jensen et al., 1990; Shafer & 
Shenoy, 1990). 

The definition of a Bayesian network does not constrain how a variable depends on its parents. 
Often, however, there is much structure in these probability functions that can be exploited for knowl- 
edge acquisition and inference. One such case is where some dependencies depend on particular 
values of other variables; such dependencies can be stated as rules (Poole, 1993), trees (Boutilier 
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et al., 1996) or as multinets (Geiger & Heckerman, 1996). Another is where the the function can be 
described using a binary operator that can be applied to values from each of the parent variables. It 
is the latter, known as 'causal independencies', that we seek to exploit in this paper. 

Causal independence refers to the situation where multiple causes contribute independently to 
a common effect. A well-known example is the noisy OR-gate model (Good, 1961). Knowledge 
engineers have been using specific causal independence models in simplifying knowledge acquisi- 
tion (Henrion, 1987; Olesen et al., 1989; Olesen & Andreassen, 1993). Heckerman (1993) was the 
first to formalize the general concept of causal independence. The formalization was later refined by 
Heckerman and Breese (1994). 

Kim and Pearl (1983) showed how the use of noisy OR-gate can speed up inference in a special 
kind of BNs known as polytrees; D' Ambrosio (1994, 1995) showed the same for two level BNs with 
binary variables. For general BNs, Olesen et al. (1989) and Heckerman (1993) proposed two ways 
of using causal independencies to transform the network structures. Inference in the transformed 
networks is more efficient than in the original networks (see Section 9). 

This paper proposes a new method for exploiting a special type of causal independence (see Sec- 
tion 4) that still covers common causal independence models such as noisy OR-gates, noisy MAX- 
gates, noisy AND-gates, and noisy adders as special cases. The method is based on the following 
observation. A BN can be viewed as representing a factorization of a joint probability into the mul- 
tiplication of a list of conditional probabilities (Shachter et al., 1990; Zhang & Poole, 1994; Li & 
D' Ambrosio, 1994). The type of causal independence studied in this paper leads to further factor- 
ization of the conditional probabilities (Section 5). A finer-grain factorization of the joint probability 
is obtained as a result. We propose to extend exact inference algorithms that only exploit conditional 
independencies to also make use of the finer-grain factorization provided by causal independence. 

The state-of-art exact inference algorithm is called clique tree propagation (CTP) (Lauritzen & 
Spiegelhalter, 1988; Jensen et al., 1990; Shafer & Shenoy, 1990). This paper proposes another al- 
gorithm called variable elimination (VE) (Section 3), that is related to SPI (Shachter et al., 1990; Li 
& D'Ambrosio, 1994), and extends it to make use of the finer-grain factorization (see Sections 6, 7, 
and 8). Rather than compiling to a secondary structure and finding the posterior probability for each 
variable, VE is query-oriented; it needs only that part of the network relevant to the query given the 
observations, and only does the work necessary to answer that query. We chose VE instead of CTP 
because of its simplicity and because it can carry out inference in large networks that CTP cannot 
deal with. 

Experiments (Section 10) have been performed with two CPCS networks provided by Pradhan. 
The networks consist of 364 and 421 nodes respectively and they both contain abundant causal in- 
dependencies. Before this paper, the best one could do in terms of exact inference would be to first 
transform the networks by using Jensen et al.'s or Heckerman's technique and then apply CTP. In 
our experiments, the computer ran out of memory when constructing clique trees for the transformed 
networks. When this occurs one caimot answer any query at all. However, the extended VE algo- 
rithm has been able to answer almost all randomly generated queries with twenty or less observations 
(findings) in both networks. 

One might propose to first perform Jensen et al. 's or Heckerman's transformation and then apply 
VE. Our experiments show that this is significantly less efficient than the extended VE algorithm. 

We begin with a brief review of the concept of a Bayesian network and the issue of inference. 
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2. Bayesian Networks 

We assume that a problem domain is characterized by a set of random variables. Beliefs are repre- 
sented by a Bayesian network (BN) — an annotated directed acyclic graph, where nodes represent 
the random variables, and arcs represent probabilistic dependencies amongst the variables. We use 
the terms 'node' and 'variable' interchangeably. Associated with each node is a conditional proba- 
bility of the variable given its parents. 

In addition to the explicitly represented conditional probabilities, a BN also implicitly represents 
conditional independence assertions. Let si, a;2, a;„ be an enumeration of all the nodes in a BN 
such that each node appears after its children, and let ir^, be the set of parents of a node Xi. The 
Bayesian network represents the following independence assertion: 

Each variables^ is conditionallyindependentofthe variables in{a;i, a;2, . . ., given 
values for its parents. 

The conditional independence assertions and the conditional probabilities together entail a j oint prob- 
ability over all the variables. By the chain rule, we have: 

n 

P{xi,X2, . . .,Xn) = Y[P{x^\xi,X2, . . .,X^_i) 

8 = 1 

n 

= l[P{x,\7r,^), (1) 

8 = 1 

where the second equation is true because of the conditional independence assertions. The condi- 
tional probabilities P{xi\Trxi) are given in the specification of the BN. Consequently, one can, in 
theory, do arbitrary probabilistic reasoning in a BN. 

2.1 Inference 

Inference refers to the process of computing the posterior probability P(X\Y=Yo) of a set X of 
query variables after obtaining some observations Y=Yo. Here y is a list of observed variables and 
lo is the corresponding list of observed values. Often, X consists of only one query variable. 

In theory, P(^X\Y=Yo) can be obtained from the marginal probability Y), which in turn 
can be computed from the joint probability P{xi, X2, . . Xn) by summing out variables outside 
XUY one by one. In practice, this is not viable because summing out a variable from a joint proba- 
bility requires an exponential number of additions. 

The key to more efficient inference lies in the concept of factorization. A factorization of a joint 
probability is a list of factors (functions) from which one can construct the joint probability. 

A factor is a function from a set of variables into a number. We say that the factor contains a vari- 
able if the factor is a function of that variable; or say it is a factor of the variables on which it depends. 
Suppose fi and /2 are factors, where /i is a factor that contains variables si, . . . , s^, yi, . . . , ?/j — 
we write this as f i{xi, .. .,Xi,yi, .. .,yj) — and /2 is a factor with variables yi, . . . ,yj, zi, . . . , Zk, 
where yi , . . . , are the variables in common to /i and /2 . The product of /i and /2 is a factor that 
is a function of the union of the variables, namely xi, . . . ,Xi,yi, . . . ,yj, zi, . . . , Zk, defined by: 

(/iX/2)(a;i, . ..,x„yi, . ..,yj,zi, ...,Zk) = fi{xi, . ..,x„yi, . . . , y^) x/2(yi, . ..,yj,zi, ...,Zk) 
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Figure 1 : A Bayesian network. 

Let . . . , Xi) be a function of variable xi, . . . ,Xi. Setting, say xi'm f{xi, . . . ,Xi) to a particular 
value a yields f[xi=a, a;2, . . . , Xi), which is a function of variables a;2, . . . , Xi. 

If /(si, . . . , Si) is a factor, we can sum out a variable, say xi, resulting in a factor of variables 
a;2, . . . , Xi, defined 

(X]/)(^2, ...,Xi) = f{xi=ai,X2, ...,X^)-\ \- f{xi = a.ra,X2, . . . , X^) 

Xl 

where ai , . . . , are the possible values of variable xi. 

Because of equation (1), a BN can be viewed as representing a factorization of a joint probability. 
For example, the Bayesian network in Figure 1 factorizes the joint probability P{a, b, c, ei, 62, 63) 
into the following list of factors: 

P{a),P{b),P{c),P{ei\a, b, c),P{e2\a, b, c), P(e3|ei, 62)- 

Multiplying those factors yields the joint probability. 

Supposeajointprobability P(2:i, 2:2, . . ■ , z^) isfactorized into the multiplication of a list of fac- 
tors /i, /2, fm- While obtaining P{z2^ . . . , Zm) by summing out zi from P{zi, 2:2, ... , Zm) re- 
quires an exponential number of additions, obtaining a factorization of P{z2 , . . . , z^) can often be 
done with much less computation. Consider the following procedure: 

Procedure sum-out (J'^, z): 

• Mputs: T — a list of factors; z — a variable. 

• Output: A list of factors. 

1 . Remove from the T all the factors, say fi, fk, that contain z, 

2. Add the new factor nf=i ft to and return J^. 

Theorem 1 Suppose a joint probability P{zi , Z2, ■ ■ ■ , z.^) is factorized into the multiplication of a 
listT of factors. Then sum-out(J^, zi) returns a list offactors whose multiplication is P{z2, . . . , Zm)- 
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Proof: Suppose T consists of factors fi, f2^ fm and suppose zi appears in and only in factors 
/i, /2, fk- Then 

P{z2,...,Zm) = Pjzi, Z2, . . . , Zm) 

m km 

= En/»- n 

zi i=l zi i=l i=k+l 

The theorem follows. □ 

Only variables that appear in the factors fi, f2, fk participated in the computation of sum-out(J^, zi), 
and those are often only a small portion of all the variables. This is why inference in a BN can be 
tractable in many cases, even if the general problem is NP-hard (Cooper, 1990). 

3. The Variable Elimination Algorithm 

B ased on the discus sions of the previous section, we present a simple algorithm for computing P{X\Y=Yo). 
The algorithm is based on the intuitions underlying D' Ambrosio's symbolic probabilistic inference 
(SPI) (Shachter et al., 1990; Li & D' Ambrosio, 1994), and first appeared in Zhang and Poole (1994). 
It is essentially Dechter (1996)'s bucket elimination algorithm for belief assessment. 

The algorithm is called variable elimination (VE) because it sums out variables from a list of 
factors one by one. An ordering p by which variables outside XUY to be summed out is required as 
an input. It is called an elimination ordering. 

Procedure VE(J^, X, Y, Yq, p) 

• Inputs: T — The list of conditional probabilities in a BN; 

X — A list of query variables; 

Y — A list of observed variables; 

lo — The corresponding list of observed values; 

p — An elimination ordering for variables outside X\JY . 

• Output: P{X\Y=Yo). 

1. Set the observed variables in all factors to their corresponding observed values. 

2. While p is not empty, 

(a) Remove the first variable z from p, 

(b) Call sum-out (J'^, z). Endwhile 

3. Set h = the multiplication of all the factors on T. 
/* h is a function of variables in X. */ 

4. Return h{X)/ J2x h{X)./* Renormalization */ 
Theorem 2 The output of , X, Y, Yq, p) is indeed P{X\Y=Yq). 

Proof: Consider the following modifications to the procedure. First remove step 1 . Then the factor 
h produced at step 3 will be a function of variables in X and Y . Add a new step after step 3 that sets 
the observed variables in h to their observed values. 
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Let /(y, A) be a function of variable y and of variables in A. We use /(y, A)\y=a to denote 
f{y=a, A). Let f{y, -), g{y, -), and h{y, z, -) be three functions of y and other variables. It is 
evident that 

f{y, -)g{y, -)\y=a = f{y, -)\y=ag{y, -)\y=a, 

[Y,Hy,Z, -)]\y=a = Y,[Hy,2:, -)\y=a]- 

z z 

Consequently, the modifications do not change the output of the procedure. 

According to Theorem 1 , after the modifications the factor produced at step 3 is simply the marginal 
probability P{X, Y). Consequently, the output is exactly P{X\Y=Yo). □ 

The complexity of VE can be measured by the number of numerical multiplications and numeri- 
cal summations it performs. An optimal elimination ordering is one that results in the least complex- 
ity. The problem of finding an optimal elimination ordering is NP-complete (Amborg et al., 1987). 
Commonly used heuristics include minimum deficiency search (Bertele & Brioschi, 1972) and max- 
imum cardinality search (Tarjan & Yannakakis, 1984). KjEerulff (1990) has empirically shown that 
minimum deficiency search is the best existing heuristic. We use minimum deficiency search in our 
experiments because we also found it to be better than the maximum cardinality search. 

3.1 VE versus Clique Tree Propagation 

Clique tree propagation (Lauritzen & Spiegelhalter, 1988; Jensen et al., 1990; Shafer & Shenoy, 
1990) has a compilation step that transforms a BN into a secondary structure called clique tree or 
junction tree. The secondary structure allows CTP to compute the answers to all queries with one 
query variable and a fixed set of observations in twice the time needed to answer one such query in 
the clique tree. For many applications this is a desirable property since a user might want to compare 
the posterior probabilities of different variables. 

CTP takes work to build the secondary structure before any observations have been received. 
When the Bayesian network is reused, the cost of building the secondary structure can be amortized 
over many cases. Each observation entails a propagation though the network. 

Given all of the observations, VE processes one query at a time. If a user wants the posterior 
probabilities of several variables, or for a sequence of observations, she needs to run VE for each of 
the variables and observation sets. 

The cost, in terms of the number of summations and multiplications, of answering a single query 
with no observations using VE is of the same order of magnitude as using CTP. A particular clique 
tree and propagation sequence encodes an elimination ordering; using VE on that elimination order- 
ing results in approximately the same summations and multiplications of factors as in the CTP (there 
is some discrepancy, as VE does not actually form the marginals on the cliques, but works with condi- 
tional probabilities directly). Observations make VE simpler (the observed variables are eliminated 
at the start of the algorithm), but each observation in CTP requires propagation of evidence. Because 
VE is query oriented, we can prune nodes that are irrelevant to specific queries (Geiger et al., 1990; 
Lauritzen et al., 1990; Baker & Boult, 1990). In CTP, on the other hand, the clique tree structure is 
kept static at run time, and hence does not allow pruning of irrelevant nodes. 

CTP encodes a particular space-time tradeoff, and VE another. CTP is particularly suited to the 
case where observations arrive incrementally, where we want the posterior probability of each node. 



306 



Exploiting Causal Independence in Bayesian Network Inference 



and where the cost of building the clique tree can be amortized over many cases. VE is suited for 
one-off queries, where there is a single query variable and all of the observations are given at once. 

Unfortunately, there are large real-world networks that CTP cannot deal with due to time and 
space complexities (see Section 10 for two examples). In such networks, VE can still answer some 
of the possible queries because it permits pruning of irrelevant variables. 

4. Causal Independence 

Bayesian networks place no restriction on how a node depends on its parents. Unfortunately this 
means that in the most general case we need to specify an exponential (in the number of parents) 
number of conditional probabilities for each node. There are many cases where there is structure in 
the probability tables that can be exploited for both acquisition and for inference. One such case that 
we investigate in this paper is known as 'causal independence' . 

In one interpretation, arcs in a BN represent causal relationships; the parents ci, C2, . . . , of a 
variable e are viewed as causes that jointly bear on the effect e. Causal independence refers to the 
situation where the causes ci , C2, . . . , contribute independently to the effect e. 

More precisely, ci , C2 , . . . , are said to be causally independent w.r.t. effect e if there exist 
random variables ^i, ^2, • • • , ■Cm that have the same frame, i.e., the same set of possible values, as e 
such that 

1 . For each i, probabilistically depends on Ci and is conditionally independent of all other Cj 's 
and all other 's given Ci, and 

2. There exists a commutative and associative binary operator * over the frame of e such that 
e = ^i<2* • • - Am- 
using the independence notion of Pearl (1988), let /(X, Y\Z) mean that X is independent of Y 
given Z, the first condition is: 

^(6,{c2,---,Cm,6,---,?m}|ci) 

and similarly for the other variables. This entails /(^i, Cj\ci) and /(^i, ^j|ci) for each Cj and 
where j / 1. 

We refer to as the contribution of Ci to e. In less technical terms, causes are causally indepen- 
dent w.r.t. their common effect if individual contributions from different causes are independent and 
the total influence on the effect is a combination of the individual contributions. 

We call the variable e a convergent variable as it is where independent contributions from differ- 
ent sources are collected and combined (and for the lack of a better name). Non-convergent variables 
will simply be called regular variables. We call * the base combination operator of e. 

The definition of causal independence given here is slightly different than that given by Hecker- 
man and Breese (1994) and Srinivas (1993). However, it still covers common causal independence 
models such as noisy OR-gates (Good, 1961; Pearl, 1988), noisy MAX-gates (Diez, 1993), noisy 
AND-gates, and noisy adders (Dagum & Galper, 1993) as special cases. One can see this in the fol- 
lowing examples. 

Example 1 (Lottery) Buying lotteries affects your wealth. The amounts of money you spend on 
buying different kinds of lotteries affect your wealth independently. In other words, they are causally 
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independent w.r.t. the change in your wealth. Let ci , . . . , denote the amounts of money you spend 
on buying k types of lottery tickets. Let , . . . , be the changes in your wealth due to buying the 
different types of lottery tickets respectively. Then, each S,i depends probabilistically on Ci and is 
conditionally independent of the other Cj and given c^. Let e be the total change in your wealth 
due to lottery buying. Then e=^i+ • • • +^k- Hence ci , . . . , are causally independent w.r.t. e. The 
base combination operator of e is numerical addition. This example is an instance of a causal inde- 
pendence model called noisy adders. 

If ci , . . . , Cfc are the amounts of money you spend on buying lottery tickets in the same lottery, 
then ci , . . . , Cfc are not causally independent w.r.t. e, because wiiming with one ticket reduces the 
chance of winning with the other. Thus, is not conditionally independent of ^2 given ci . However, 
if the Ci represent the expected change in wealth in buying tickets in the same lottery, then they would 
be causally independent, but not probabilistically independent (there would be arcs between the Ci 's). 

Example 2 (Alarm) Consider the following scenario. There are m different motion sensors each 
of which are connected to a burglary alarm. If one sensor activates, then the alarm rings. Different 
sensors could have different reliability. We can treat the activation of sensor i as a random variable. 
The reliability of the sensor can be reflected in the S,i . We assume that the sensors fail independently ^ 
Assume that the alarm can only be caused by a sensor activation^. Then alarm=^i\/ ■ ■ ■ V^m', the 
base combination operator here is the logical OR operator. This example is an instance of a causal 
independence model called the noisy OR-gate. 

The following example is not an instance of any causal independence models that we know: 

Example 3 (Contract renewal) Faculty members at a university are evaluated in teaching, research, 
and service for the purpose of contract renewal. A faculty member's contract is not renewed, re- 
newed without pay raise, renewed with a pay raise, or renewed with double pay raise depending on 
whether his performance is evaluated unacceptable in at least one of the three areas, acceptable in 
all areas, excellent in one area, or excellent in at least two areas. 

Let ci, C2, and C3 be the fractions of time a faculty member spends on teaching, research, and 
service respectively. Let represent the evaluation he gets in the ith area. It can take values 0, 1, 
and 2 depending on whether the evaluation is unacceptable, acceptable, or excellent. The variable 
depends probabilistically on c^. It is reasonable to assume that is conditionally independent of 
other Cj 's and other 's given Ci. 

Let e represent the contract renewal result. The variable can take values 0, 1, 2, and 3 depending 
on whether the contract is not renewed, renewed with no pay raise, renewed with a pay raise, or 
renewed with double pay raise. Then e=^i*^2*?3> where the base combination operator * is given 
in this following table: 








1 


2 


3 

















1 





1 


2 


3 


2 





2 


3 


3 


3 





3 


3 


3 



L This is called the exception independence assumption by Pearl (1988). 

2. This is called the accountahility assumption by Pearl (1988). The assumption can always be satisfied by introducing 
a node that represent all other causes (Henrion, 1987). 
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So, the fractions of time a faculty member spends in the three areas are causally independent 
w.r.t. the contract renewal result. 

In the traditional formulation of a Bayesian network we need to specify an exponential, in the 
number of parents, number of conditional probabilities for a variable. With causal independence, 
the number of conditional probabilities P{S,i\ci) is linear in m. This is why causal independence 
can reduce complexity of knowledge acquisition (Henrion, 1987; Pearl, 1988; Olesen et al., 1989; 
Olesen & Andreassen, 1993). In the following sections we show how causal independence can also 
be exploited for computational gain. 

4.1 Conditional Probabilities of Convergent Variables 

VE allows us to exploit structure in a Bayesian network by providing a factorization of the joint prob- 
ability distribution. In this section we show how causal independence can be used to factorize the 

joint distribution even further. The initialfactors in the VE algorithm areoftheformP(e|ci, . . . , c^). 
We want to break this down into simpler factors so that we do not need a table exponential in m. The 
following proposition shows how causal independence can be used to do this: 

Proposition 1 Let e be a node in a BN and let ci, C2, ■ ■ ■ , Cm be the parents of e. If ci, C2, ■ ■ ■ , Cm 
are causally independent w.r.t. e, then the conditional probability P(e|ci, . . . , c^) can be obtained 
from the conditional probabilities P{^i \ Ci) through 

P{e = a\ci, . . .,C.ra) = P{^l=0!l\ci). . .P{Cm = a.m\Cm), (2) 

for each value a of e. Here * is the base combination operator of e. 

Proof: ^ The definition of causal independence entails the independence assertions 

^(6, {c2, • • • , Cm}|ci) and 6|ci). 

By the axiom of weak union (Pearl, 1 988, p. 84), we have / (^i , ^2 1 { ci , . . . , } ) . Thus all of the 
mutually independent given {ci, . . . , Cm}. 

Also we have, by the definition of causal independence /(^i, {c2, . . . , Cm}|ci), so 

P(6|{C1,C2,...,C„}) = P(6|ci) 

Thus we have: 

P(e=a|ci, . . .,Cm) 

= • • •<m=a|ci, . . .,Cm) 

= X] -P(6 = "l, • • •,'6m = "m|ci, . . ., Cm) 

= X] -P(6="i|ci, . . .,Cm)P(^2=a2|ci, . . .,Cm) • • •P(^m=am|ci, . . .,Cm) 

^ P(6 = ai|ci)P(5 = a2|c2) • ■ ■P{^m = (^m\Cm) 



□ 

The next four sections develop an algorithm for exploiting causal independence in inference. 



3. Thanks to an anonymous reviewer for helping us to simphfy this proof. 
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5. Causal Independence and Heterogeneous Factorizations 

In this section, we shall first introduce an operator for combining factors that contain convergent 
variables. The operator is a basic ingredient of the algorithm to be developed in the next three sec- 
tions. Using the operator, we shall rewrite equation (2) in a form that is more convenient to use in 
inference and introduce the concept of heterogeneous factorization. 

Consider two factors / and g . Let e i , . . . , e ^ be the convergent variables that appear in both / and 
g, let A be the list of regular variables that appear in both / and g, let B be the list of variables that 
appear only in /, and let C be the list of variables that appear only in g. Both B and C can contain 
convergent variables, as well as regular variables. Suppose *i is the base combination operator of 
Ci. Then, the combination f®g of f and ^ is a function of variables ei, Ck and of the variables 
in A, B, and C. It is defined by:'* 

f<S)g{ei=ai, ek=ak, A, B, C) 

= X] ••• X] f{ei=aii,...,ek=aki,A,B) 

Qfll*lQfl2=ai 'ykl*k"k2="k 

g{ei=ai2, . . .,ek=ak2,A,C'), (3) 

for each value ai of e^. We shall sometimes write f(S)g as f{ei, . . .,ek,A,B)(S)g{ei, . . .,ek,A,C') 
to make explicit the arguments of / and g. 

Note that base combination operators of different convergent variables can be different. 

The following proposition exhibits some of the basic properties of the combination operator ® . 

Proposition 2 1. If f and g do not share any convergent variables, then f®g is simply the multipli- 
cation of f and g. 2. The operator ® is commutative and associative. 

Proof: The first item is obvious. The commutativity of ® follows readily from the commutativity of 
multiplication and the base combination operators. We shall prove the associativity of ® in a special 
case. The general case can be proved by following the same line of reasoning. 

Suppose /, g, and h are three factors that contain only one variable e and the variable is conver- 
gent. We need to show that {f®g) (i)h=f(i){g(i)h) . Let * be the base combination operator of e. By 
the associativity of *, we have, for any value a of e, that 

{f<S)g)<S)h{e=a) = ^ fc^jg[e=a4)h{e=a3) 

= J2 ^ f{e=ai)g{e=a2)]h{e=a3) 

Q'4*Q'3=Q' 0.1^0.2 = 0.4 

= f{e=ai)g{e=a2)h{e=a3) 

Oi *Q'2 ^O.s = 0. 

= /('5="i)[ Y 9{e=a2)h{e=a3)] 

4. Note that the base combination operators under the summations are indexed. With each convergent variable is an asso- 
ciated operator, and we always use the binary operator that is associated with the corresponding convergent variable. 
In the examples, for ease of exposition, we will use one base combination operator. Where there is more than one type 
of base combination operator (e.g., we may use 'or', 'sum' and 'max' for different variables in the same network), we 
have to keep track of which operators are associated with which convergentvariables. This will, however, complicate 
the description. 
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= J2 f{e=ai)g®h{e=ai) 

= f<S){g<S)h){e=a). 

The proposition is hence proved. □ 

The following propositions give some properties for eg) that correspond to the operations that we 
exploited for the algorithm VE. The proofs are straight forward and are omitted. 

Proposition 3 Suppose f and g are factors and variable z appears in f and not in g, then 

z z 
z z 

Proposition 4 Suppose f, g and h are factors such that g and h do not share any convergent vari- 
ables, then 

g{m) = {gf)®h. (4) 

5.1 Rewriting Equation 2 

Noticing that the contribution variable has the same possible values as e, we define functions 

f^{e,Ci) by 

f{e=a,c^) = P{^^=a\c^), 

for any value a of e. We shall refer to fi as the contributing factor of Ci to e. 
By using the operator ®, we can now rewrite equation (2) as follows 

P(e|ci,...,c„) = ®^i/,(e,c,). (5) 

It is interesting to notice the similarity between equation (1) and equation (5). In equation (1) 
conditional independence allows one to factorize a joint probability into factors that involve less 
variables, while in equation (5) causal independence allows one to factorize a conditional probability 
into factors that involve less variables. However, the ways by which the factors are combined are 
different in the two equations. 

52 Heterogeneous Factorizations 

Consider the Bayesian network in Figure 1. It factorizes the joint probability P{a, b, c, ei, 62, 63) 
into the following list of factors: 

P{a),P{b),P{c),P{ei\a, b, c),P{e2\a, b, c), P(e3|ei, 62)- 

We say that this factorization is homogeneous because all the factors are combined in the same way, 
i.e., by multiplication. 

Now suppose the e/s are convergent variables. Then their conditional probabilities can be fur- 
ther factorized as follows: 

P{ei\a,b,c) = /ii(ei, a)®/i2(ei, 5)®/i3(ei, c), 
P{e2\a,b,c) = /21 (62, a) 0/22(62, &)®/23(e2, c), 
-P(e3|ei,e2) = /31 (63, 61)0/32(63, 62), 
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where the factor /ii(ei, a) , for instance, is the contributing factor of a to ei. 
We say that the following list of factors 

/ii(ei, a), /i2(ei, 6), /i3(ei, c), /2i(e2, a), /22(e2, /23(e2, c), /3i(e3, ei), 732(63, 62), 
P(a),P(5), andP(c) (6) 

constitute a heterogeneous factorization of P(a, 5, c, ei, 62, 63) because the joint probability can be 
obtained by combining those factors in a proper order using either multiplication or the operator ® . 
The word heterogeneous is to signify the fact that different factor pairs might be combined in differ- 
ent ways. We call each fi^ a heterogeneous factor because it needs to be combined with the other 
fik 's by the operator ® before it can be combined with other factors by multiplication. In contrast, 
we call the factors P{a), P{b), and P{c) homogeneous factors. 

We shall refer to that heterogeneous factorization as the heterogeneous factorization represented 
by the BN in Figure 1 . It is obvious that this heterogeneous factorization is of finer grain than the 
homogeneous factorization represented by the BN. 



6. Flexible Heterogeneous Factorizations and Deputation 

This paper extends VE to exploit this finer-grain factorization. We will compute the answer to a query 
by summing out variables one by one from the factorization just as we did in VE. 

The correctness of VE is guaranteed by the fact that factors in a homogeneous factorization can 
be combined (by multiplication) in any order and by the distributivity of multiplication over sum- 
mations (see the proof of Theorem 1). 

According to Proposition 3, the operator ® is distributive over summations. However, factors in 
a heterogeneous factorization cannot be combined in arbitrary order. For example, consider the het- 
erogeneous factorization (6). While it is correct to combine /n (ei , a) and /12 (ei , b) using ®, and to 
combine /31 (63, ei) and 732(63, 62) using®, it is not correct to combine /n (ei, a) and /31 (63, ei) 
with ®. We want to combine these latter two by multiplication, but only after each has been com- 
bined with its sibling heterogeneous factors. 

To overcome this difficulty, a transformation called deputation will be performed on our BN. 
The transformation does not change the answers to queries. And the heterogeneous factorization 
represented by the transformed BN is flexible in the following sense: 

A heterogeneous factorization of a joint probability is flexible if: 

The joint probability 

= multiplication of all homogeneous factors 

X combination (by ®) of all heterogeneous factors. (7) 

This property allows us to carry out multiplication of homogeneous factors in arbitrary order, 
and since ® is associative and commutative, combination of heterogeneous factors in arbitrary or- 
der. If the conditions of Proposition 4 are satisfied, we can also exchange a multiplication with a 
combination by ®. To guarantee the conditions of Proposition 4, the elimination ordering needs to 

be constrained (Sections 7 and 8). 

The heterogeneous factorization of P(a, 5, c, ei, 62, 63) given at the end of the previous section is 
not flexible. Consider combining all the heterogeneous factors. Since the operator eg) is commutative 
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Figure 2: The BN in Figure 1 after the deputation of convergent variables. 

and associative, one can first combine, for each i, all the /i^'s, obtaining the conditional probability 
of Ci, and then combine the resulting conditional probabilities. The combination 

P(ei|a, b, c)(S)P{e2\a, b, c)®P(e3|ei, 62) 

is not the same as the multiplication 

P(ei|a, b, c)P{e2\a, b, c)P(e3|ei, 62) 

because the convergent variables ei and 62 appear in more than one factor. Consequently, equation 
(7) does not hold and the factorization is not flexible. This problem arises when a convergent vari- 
able is shared between two factors that are not siblings. For example, we do not want to combine 
/ii (ei, a) and /31 (es, ei) using ®. In order to tackle this problem we introduce a new 'deputation' 
variable so that each heterogeneous factor contains a single convergent variable. 

Deputation is a transformation that one can apply to a BN to make the heterogeneous factoriza- 
tion represented by the BN flexible. Let e be a convergent variable. To depute e is to make a copy 
e' of e, make the parents of e be parents of e', replace e with e' in the contributing factors of e, make 
e' the only parent of e, and set the conditional probability P{e\e') as follows: 

^ ' ^ [0 otherwise 

We shall call e' the deputy of e. The deputy variable e' is a convergent variable by definition. The 

variable e, which is convergent before deputation, becomes a regular variable after deputation. We 
shall refer to it as a new regular variable. In contrast, we shall refer to the variables that are regular 
before deputation as old regular variables. The conditional probability P{e'\e) is a homogeneous 
factor by definition. It will sometimes be called the deputing function and written as /(e', e) since it 
ensures that e' and e always take the same value. 

A deputation BN is obtained from a BN by deputing all the convergent variables. In a deputation 
BN, deputy variables are convergent variables and only deputy variables are convergent variables. 
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Figure 2 shows the deputation of the BN in Figure 1 . It factorizes the joint probability 

P(a, 5, 0,61,61,62,62,63,63) 

into homogeneous factors 

P(a), P(5), P(c), /i(6;, 61), /2(6'2, 62), /3(e:5, es), 
and heterogeneous factors 

/ii(e'i,a),/i2(e'i,6),/i3(e'i,c),/2i(62,a),/22(e2,5),/23(e2ic),/3i(e3iei),/32(e3ie2)- 
This factorization has three important properties. 

1 . Each heterogeneous factor contains one and only one convergent variable. (Recall that the 6^ 's 
are no longer convergent variables and their deputies are.) 

2. Each convergent variable e' appears in one and only one homogeneous factor, namely the 
deputing function 7(6', e). 

3. Except for the deputing functions, none of the homogeneous factors contain any convergent 
variables. 

Those properties are shared by the factorization represented by any deputation BN. 

Proposition 5 The heterogeneous factorization represented by a deputation BN is flexible. 

Proof: Consider the combination, by ® , of all the heterogeneous factors in the deputation BN. Since 
the combination operator is commutative and associative, we can carry out the combination in fol- 
lowing two steps. First for each convergent (deputy) variable e' , combine all the heterogeneous fac- 
tors that contain e' , yielding the conditional probability P{e' \ TTe') of e' . Then combine those resulting 
conditional probabilities. It follows from the first property mentioned above that for different con- 
vergent variables e'^ and e'2, P{e'i\'Kpi^) and P(62|7rg^) do not share convergent variables. Hence the 
combination of the P(6'|7re')'s is just the multiplication of them. Consequently, the combination, 
by ®, of all heterogeneous factors in a deputation BN is just the multiplication of the conditional 
probabilities of all convergent variables. Therefore, we have 

The joint probability of variables in a deputation BN 
= multiplication of conditional probabilities of all variables 
= multiplication of conditional probabilities of all regular variables 

X multiplication of conditional probabilities of all convergent variables 
= multiplication of all homogeneous factors 

X combination (by ®) of all heterogeneous factors. 

The proposition is hence proved. □ 

Deputation does not change the answer to a query. More precisely, we have 

Proposition 6 The posterior probability P{X\Y=Yo) is the same in a BN as in its deputation. 
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Proof: Let R, E, and E' be the lists of old regular, new regular, and deputy variables in the dep- 
utation BN respectively. It suffices to show that P{R, E) is the same in the original BN as in the 
deputation BN. For any new regular variable e, let e' be its deputy. It is easy to see that the quantity 
/(e', e)P(e'|7re') in the deputation BN is the same as P(e|7re) in the original BN. Hence, 

P{R, E) in the deputation BN 

E' 

= T.T{Pi^\^r)X{[P{e\n,)P{e'K')] 

E' reR eeE 

reR eeE e' 

= n ^(H^r) n ^(^Ke) 
reR eeE 

= P{R, E) in the original BN. 

The proposition is proved. □ 

7. Tidy Heterogeneous Factorizations 

So far, we have only encountered heterogeneous factorizations that correspond to Bayesian networks. 
In the following algorithm, the intermediate heterogeneous factorizations do not necessarily corre- 
spond by BNs. They do have the property that they combine to form the appropriate marginal prob- 
abilities. The general intuition is that the heterogeneous factors must combine with their sibling het- 
erogeneous factors before being multiplied by factors containing the original convergent variable. 

In the previous section, we mentioned three properties of the heterogeneous factorization repre- 
sented by a deputation BN, and we used the first property to show that the factorization is flexible. 
The other two properties qualify the factorization as a tidy heterogeneous factorization, which is de- 
fined below. 

Let zi, Z2, Zkhe & list of variables in a deputation BN such that if a convergent (deputy) 
variable e' is in {2^1 , 2^2 , • • • , z^}, so is the corresponding new regular variable e. A flexible hetero- 
geneous factorization of P{zi , 22, • • • , ^^fc) is said to be tidy If 

1 . For each convergent (deputy) variable e' G { 21 , 22 , • • • , ^^fc } , the factorization contains the deput- 
ing function /(e', e) and it is the only homogeneous factor that involves e'. 

2. Except for the deputing functions, none of the homogeneous factors contain any convergent 
variables. 

As stated earlier, the heterogeneous factorization represented by a deputation BN is tidy. 

Under certain conditions, to be given in Theorem 3, one can obtain a tidy factorization of ^(^2 , . . . , 
by summing out zi from a tidy factorization oi P{zi, Z2t ■ ■ , Zk) using the the following procedure. 

Procedure sum-out 1 {Ti , ^^^2, -z) 

• Inputs: Ti — A list of homogeneous factors, 

T2 — A list of heterogeneous factors, 
z — A variable. 
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• Output: A list of heterogeneous factors and a list of homogeneous factors. 

1. Remove from T\ all the factors that contain z, multiply them resulting in, say, /. 
If there are no such factors, set /=nil. 

2. Remove from T2 all the factors that contain z, combine them by using ® resulting 
in, say, g. If there are no such factors, set ^=nil. 

3. If ^=nil, add the new (homogeneous) factor / to J'^i. 

4. Else add the new (heterogeneous) factor jg to T2- 

5. Return {T\^T2)- 

Theorem 3 Suppose a list of homogeneous factors T\ and a list of heterogeneous factors T2 consti- 
tute a tidy factorization of P{^z\^ Z2, . . Zk)- If zi is either a convergent variable, or an old regular 
variable, or a new regular variable whose deputy is not in the list {z2, ■ ■ ■ , Zk}, then the procedure 
sum-outl(J^i , J^25 2^1) returns a tidy heterogeneous factorization of P{z2, ■ ■ .,2^*:). 

The proof of this theorem is quite long and hence is given in the appendix. 
8. Causal Independence and Inference 

Our task is to compute P(X\Y=Yo) in a BN. According to Proposition 6, we can do this in the 
deputation of the BN. 

An elimination ordering consisting of the variables outside XUY is legitimate if each deputy 
variable e' appears before the corresponding new regular variable e. Such an ordering can be found 
using, with minor adaptations, minimum deficiency search or maximum cardinality search. 

The following algorithm computes P{X\Y=Yo) in the deputation BN. It is called VEj because 
it is an extension of VE. 

Procedure \Ey{Ti,T2,X, Y, Yq, p) 

• Inputs: Ti — The list of homogeneous factors 

in the deputation BN; 
T2 — The list of heterogeneous factors 

in the deputation BN; 
X — A list of query variables; 
Y — A list of observed variables; 
lo — The corresponding list of observed values; 
p — A legitimate elimination ordering. 

• Output: P{X\Y=Yo). 

1. Set the observed variables in all factors to their observed values. 

2. While p is not empty, 

• Remove the first variable z from p. 

• {^1,^2) = sum-out 1 (J'^i, -z)- Endwhile 
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3. Set /i=multiplication of all factors in T\ 

X combination (by ®) of all factors in ■ 
/* h is a function of variables in X. */ 

4. Return h{X)/ J2x h{X). /* renormalization */ 

Theorem 4 The output of VE-^ (-^1,-^2, X, Y, Yq, p) is indeed P{X\Y=Yq). 

Proof: Consider the following modifications to the algorithm. First remove step 1 . Then the factor 
h produced at step 3 is a function of variables in X and Y . Add a new step after step 3 that sets 
the observed variables in h to their observed values. We shall first show that the modifications do 
not change the output of the algorithm and then show that the output of the modified algorithm is 

P{X\Y=Yq). 

Let f{y,-),g{y,-), and h{y, z, -)he three functions of y and other variables . It is evident that 

f{y, -)g{y, -)\y=a = f{y, -)\y=ag{y, -)\y=a, 

[Y,Hy,Z, -)]\y=a = Y,[Hy,2:, -)\y=a]- 

z z 

If y is a regular variable, we also have 

f{y, -)<S)g{y, -)\y=a = f{y, -)\y=a<S)g{y, -)\y=a- 

Consequently, the modifications do not change the output of the procedure. 

Since the elimination ordering p is legitimate, it is always the case that if a deputy variable e' has 
not been summed out, neither has the corresponding new regular variable e. Let zi, Zkhe the re- 
maining variables in p at any time during the execution of the algorithm. Then, e' e{zi, . . . , z^} im- 
plies ee{zi, . . . , Zk}. This and the fact that the factorization represented by a deputation BN is tidy 
enable us to repeatedly apply Theorem 3 and conclude that, after the modifications, the factor created 
at step 3 is simply the marginal probability P{X, Y) . Consequently, the output is P{X\Y=Yo). □ 

8.1 An Example 

This subsection illustrates VE ^ by walking through an example. Consider computing the P{e2 \ 63=0) 
in the deputation Bayesian network shown in Figure 2. Suppose the elimination ordering p is: a, b, 
c, e[, 62, ei, and 63. After the first step of VEj , 

.Fi = {P(a), P(5), P(c), /i(e'i, ei), /2(e'2, 62), /3(e^5, 63=0)}, 

^2 = {/ii(ei, a), /i2(ei, 5), /i3(ei, c), /2i(e'2, a), /22(e2i b),f 23{e2, c), /3i(e^5, ei), 732(6^5, 62)}. 
Now the procedure enters the while-loop and it sums out the variables in p one by one. 
After summing out a, 

= {P{b),P{c),h{e[, ei), /2(e'2, e2),h{e'^, 63=0)}, 
^2 = {fi2{e'i,b),fi3{e[, c),f 22{e2, b),f 23{e2, c), /3i(e^5, ei), 732(6^5, e2),ipi{e[, e'2)}, 
where t/'i(ei,e'2) = Ea a)/2i(e2> «)• 

After summing out b, 

= {P{c) , h (e'l , ei) , l2{e'2 , 62) , h{e's, 63=0) }, 
^2 = {/i3(ei, c), /23(e2, c), /3i(e^5, ei), /32(e^5, 62), t/'i(e'i, e'2), t/'2(ei, e'2)}, 
where M^'i, 4) = Efe P{b)fi2{e[, b)f 22{e'2, b). 
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After summing out c, 

-^1 = {h{e'i,ei),l2{e2,e2),l3{e3,e3=0)}, 

^2 = {fsiie's, ei), /32(e3, 62), V'i(e'i, e'2), t/'2(ei, e'2), t/'3(ei, e'2)}, 
where i^sie'i, 4) = Ec ^(c)/23(e2, c)/i3(e'i, c). 

After summing out e[ , 
^i = {/2(e'2,e2),/3(e^5,e3=0)}, 
^2 = {fsiie'a, ei), /32(e3, 62), t/'4(ei, e'2)}, 

where t/'4(ei,e'2) = J2e[ A(e'i, ei)[t/'i(e'i, e'2)®t/'2(e'i, e'2)®t/'3(e'i, e^,)]. 
After summing out e'2, 

^l = {/3 (6^5,63 = 0)}, 

^2 = {fsiie'a, ei), /32(e3, 62), t/'5(ei, 62)}, 
where V's (61,62) = Ee^ /2(e2, e2)t/'4(ei, e'2). 

After summing out ei, 

^2 = {/32(e3, 62), t/'6(e3, 62)}, 

where t/'6(e3, 62) = Eei /3i(e3, ei)t/'5(ei, 62). 
Finally, after summing out 63, 

^1 = 0, 

-^2 = {^'7(62)}, 

where '(/'7 (62) = Ee^ -^3(^35 e3=0)[/32(e3, e2)®'(/'6(e3, 62)]. Now the procedure enters step 3, where 
there is nothing to do in this example. Finally, the procedure returns ^'7(^2) / Ee2 ^'7(^2)5 which is 
^(^2 1^3=0), the required probability. 

8.2 Comparing VE and VE J 

In comparing VE and VE ^ , we notice that when summing out a variable, they both combine only 
those factors that contain the variable. However, the factorization that the latter works with is of 
finer grain than the factorization used by the former. In our running example, the latter works with a 
factorization which initially consists of factors that contain only two variables; while the factoriza- 
tion the former uses initially include factors that contain five variables. On the other hand, the latter 
uses the operator which is more expensive than multiplication. Consider, for instance, calculating 
/(e, a)(S)g{e, b). Suppose e is a convergent variable and all variables are binary. Then the opera- 
tion requires 2'^ numerical multiplications and 2'^ — 2"^ numerical summations. On the other hand, 
multiplying f{e,a) and g{e,b) only requires 2? numerical multiplications. 

Despite the expensiveness of the operator ®, VE^ is more efficient than VE. We shall provide 
empirical evidence in support of this claim in Section 10. To see a simple example where this is true, 
consider the BN in Figure 3(1), where e is a convergent variable. Suppose all variables are binary. 
Then, computing P{e) by VE using the elimination ordering ci, C2, C3, and C4 requires 2^ + 2"* + 
23 + 2^=60 numerical multiplications and (2^ - 2'^) + {2'^ - 2^) + (2^ - 2^) + (2^ - 2)=30 
numerical additions. On the other hand, computing P(e) in the deputation BN shown in Figure 3(2) 
by VEj using the elimination ordering ci, C2, C3, C4, and e' requires only 2^ + 2^ + 2^ + 2^ + 
(3x2^ + 22)=32 numerical multiplications and 2 + 2 + 2 + 2+ (3x2 + 2)=16 numerical additions. 
Note that summing out e' requires 3x2^ + 2^ numerical multiplications because after summing out 
Cj 's, there are four heterogeneous factors, each containing the only argument e' . Combining them 



318 



Exploiting Causal Independence in Bayesian Network Inference 




Figure 3: A BN, its deputation and transformations. 

pairwise requires 3x2^ multiplications. The resultant factor needs to be multiplied with the deputing 
factor /(e', e), which requires 2^ numerical multiplications. 

9. Previous Methods 

Two methods have been proposed previously for exploiting causal independence to speed up infer- 
ence in general BNs (Olesen et al., 1989; Heckerman, 1993). They both use causal independence to 
transform the topology of a BN. After the transformation, conventional algorithms such as CTP or 
VE are used for inference. 

We shall illustrate those methods by using the BN in Figure 3(1). Let * be the base combination 
operator of e, let denote the contribution of Ci to e, and let fi{e, Ci) be the contributing factor of Ci 
to e. 

The parent-divorcing method (Olesen et al., 1989) transforms the BN into the one in Figure 3(3). 

After the transformation, all variables are regular and the new variables ei and 62 have the same 
possible values as e. The conditional probabilities of ei and 62 are given by 

-P(ei|ci, c2)=/i(e, ci)®/2(e, C2), 

-P(e2|c3, C4)=/3(e, €3)0/4(6, C4). 

The conditional probability of e is given by 

P(e=a|ei=ai, €2=0:2) = 1 if a=ai*a2, 

for any value a of e, ai of ei, and 02 of 62- We shall use PD to refer to the algorithm that first 
performs the parent-divorcing transformation and then uses VE for inference. 
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The temporal transformationhy Heckerman (1993) converts the BN into the one in Figure 3(4). 
Again all variables are regular after the transformation and the newly introduced variables have the 
same possible values as e. The conditional probability of ei is given by 

P{ei=a\ci) = /i(^i=a,ci), 

for each value a of ei . For i=2, 3, 4, the conditional probability of (64 stands for e) is given by 

Oil = a 

for each possible value a of and ai of ei_i. We shall use TT to refer to the algorithm that first 
performs the temporal transformation and then uses VE for inference. 

The factorization represented by the original BN includes a factor that contain five variables, 
while factors in the transformed BNs contain no more than three variables. In general, the transfor- 
mations lead to finer-grain factorizations of joint probabilities. This is why PD and TT can be more 
efficient than VE . 

However, PD and TT are not as efficient as VEj . We shall provide empirical evidence in support 
of this claim in the next section. Here we illustrate it by considering calculating P{e). Doing this in 
Figure 3(3) by VE using the elimination ordering ci, C2, C3, C4, ei, and 62 would require 2"^ + 2^ + 
2"^ + 2^ + 2"^ + 2^=36 numerical multiplications and 18 numerical additions.^ Doing the same in 
Figure 3(4) using the elimination ordering ci, ei, C2, 62, C3, 63, C4 would require 2^ + 2"^ + 2^ + 
2"^ + 2^ + 2"^ + 2^=40 numerical multiplications and 20 numerical additions. In both cases, more 
numerical multiplications and additions are performed than VEj . The differences are more drastic 
in complex networks, as will be shown in the next section. 

The saving for this example may seem marginal. It may be reasonable to conjecture that, as 
Oleson's method produces families with three elements, this marginal saving is all that we can hope 
for; producing factors of two elements rather than cliques of three elements. However, interacting 
causal variables can make the difference more extreme. For example, if we were to use Oleson's 
method for the BN of Figure 1 , we produce^ the network of Figure 4. Any triangulation for this 
network has at least one clique with four or more elements, yet VEj does not produce a factor with 
more than two elements. 

Note that as far as computing P{e) in the networks shown in Figure 3 is concerned, VEj is more 
efficient than PD, PD is more efficient than TT, and TT is more efficient than VE. Our experiments 
show that this is true in general. 

5. This is exactly the same number of operations required to determine P(e) using clique-tree propagation on the same 
network. The clique tree for Figure 3(3) has three cliques, one containing {ci, C2, ei}, one containing {cs, C4, 62}, and 
once containing {ei , 62 , e}. The first chquc contains 8 elements; to construct it requires 2^ + 2^ = 12 multiplications. 
The message that needs to be sent to the third cUque is the marginal on ei which is obtained by summing out ci and 
C2 . Similarly for the second chque. The third chque again has 8 elements and requires 1 2 muMphcations to construct. 
In order to extract P(e) from this clique, we need to sum out ei and 62. This shown one reason why VE j can be more 
efficient that CTP or VE; VEj never constructs a factor with three variables for this example. Note however, that an 
advantage of CTP is that the cost of building the cliques can be amortized over many queries. 

6. Note that we need to produce two variables both of which represent "noisy" a * 6. We need two variables as the noise 
appUed in each case is independent. Note that if there was no noise in the network — ifei =a*6*c — we only 
need to create one variable, but also e 1 and 62 would be the same variable (or at least be perfectly correlated). In this 
case we would need a more complicated example to show our point. 
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Figure 4: The result of Applying Oleson's method to the BN of Figure 1. 
10. Experiments 

The CPCS networks are multi-level, multi-valued BNs for medicine. They were created by Pradhan 
et al. (1994) based on the Computer-based Patient Case Simulation system (CPCS-PM) developed 
by Parker and Miller (1987). Two CPCS networks^ were used in our experiments. One of them 
consists of 422 nodes and 867 arcs, and the other contains 364 nodes. They are among the largest 
BNs in use at the present time. 

The CPCS networks contain abundant causal independencies. As a matter of fact, each non-root 
variable is a convergent variable with base combination operator MAX. They are good test cases for 
inference algorithms that exploit causal independencies. 

10.1 CTP-based Approaches versus VE-based Approaches 

As we have seen in the previous section, one kind of approach for exploiting causal independencies 
is to use them to transform BNs. Thereafter, any inference algorithms, including CTP or VE, can be 

used for inference. 

We found the coupling of the network transformation techniques and CTP was not able to carry 
out inference in the two CPCS networks used in our experiments. The computer ran out memory 
when constructing clique trees for the transformed networks. As will be reported in the next subsec- 
tion, however, the combination of the network transformation techniques and VE was able to answer 
many queries. 

This paper has proposed a new method of exploiting causal independencies. We have observed 
that causal independencies lead to a factorization of a joint probability that is of finer-grain than 
the factorization entailed by conditional independencies alone. One can extend any inference al- 
gorithms, including CTP and VE, to exploit this finer-grain factorization. This paper has extended 
VE and obtained an algorithm called VE j . VE j was able to answer almost all queries in the two 
CPCS networks. We conjecture, however, that an extension of CTP would not be able to carry out 
inference with the two CPCS networks at all. Because the resources that VE y takes to answer any 
query in a BN can be no more than those an extension of CTP would take to construct a clique tree 

7. Obtained from ftp://camis.stanford.edu/pub/pradhan. The file names are CPCS-LM-SM-KO- 
VI . . txt and CPCS-networks/stdl .08.5. 
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Figure 5: Comparisons in tlie 364-node BN. 



for the BN and there are, as will be seen in the next subsection, queries in the two CPCS networks 
that VEj was not able to answer. 

In summary, CTP based approaches are not or would not be able to deal with the two CPCS 
networks, while VE-based approaches can (to different extents). 

10.2 Comparisons of VE-based Approaches 

This subsection provides experimental data to compare the VE-based approaches namely PD, TT, 
and VE^ . We also compare those approaches with VE itself to determine how much can be gained 
by exploiting causal independencies. 

In the 364-node network, three types of queries with one query variable and five, ten, or fifteen 
observations respectively were considered. Fifty queries were randomly generated for each query 
type. A query is passed to the algorithms after nodes that are irrelevant to it have been pruned. In gen- 
eral, more observations mean less irrelevant nodes and hence greater difficulty to answer the query. 
The CPU times the algorithms spent in answering those queries were recorded. 

In order to get statistics for all algorithms, CPU time consumption was limited to ten seconds 
and memory consumption was limited to ten megabytes. 

The statistics are shown in Figure 5. In the charts, the curve "5vel", for instance, displays the 
time statistics for VEj on queries with five observations. Points on the X-axis represent CPU times 
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Figure 6: Comparisons in the 422-node BN. 



in seconds. For any time point, the corresponding point on the Y-axis represents the number of five- 
observation queries that were each answered within the time by VE ^ . 

We see that while VEj was able to answer all the queries, PD and TT were not able to answer 
some of the ten-observation and fifteen-observation queries. VE was not able to answer a majority 
of the queries. 

To get a feeling about the average performances of the algorithms, regard the curves as represent- 
ing functions of y, instead of x. The integration, along the Y-axis, of the curve "lOPD", for instance, 
is roughly the total amount of time PD took to answer all the ten-observation queries that PD was 
able to answer. Dividing this by the total number of queries answered, one gets the average time PD 
took to answer a ten-observation query. 

It is clear that on average, VEj performed significantly better than PD and TT, which in turn 
performed much better than VE. The average performance of PD on five- or ten-observation queries 
are roughly the same as that of TT, and slightly better on fifteen-observation queries. 

In the 422-node network, two types of queries with five or ten observations were considered and 
fifty queries were generated for each type. The same space and time limits were imposed as in the 
364-node networks. Moreover, approximations had to be made; real numbers smaller than 0.00001 
were regarded as zero. Since the approximations are the same for all algorithms, the comparisons 
are fair. 

The statistics are shown in Figure 6. The curves "5vel" and "lOvel" are hardly visible because 
they are very close to the Y-axis. 

Again we see that on average, VEj performed significantly better than PD, PD performed sig- 
nificantly better than TT, and TT performed much better than VE. 

One might notice that TT was able to answer thirty nine ten-observation queries, more than that 
VEj and PD were able to. This is due to the limit on memory consumption. As we will see in the 
next subsection, with the memory consumption limit increased to twenty megabytes, VEj was able 
to answer forty five ten-observation queries exactly under ten seconds. 

10.3 Effectiveness of ve^ 

We have now established that VE^ is the most efficient VE-based algorithm for exploiting causal 
independencies. In this section we investigate how effective VEj is. 
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Figure 7: Time statistics for VEj . 

Experiments have been carried out in both of the two CPCS networks to answer this question. In 
the 364-node network, four types of queries with one query variable and five, ten, fifteen, or twenty 
observations respectively were considered. Fifty queries were randomly generated for each query 
type. The statistics of the times VE j took to answer those queries are given by the left chart in Figure 
7. When collecting the statistics, a ten MB memory limit and a ten second CPU time limit were 
imposed to guide against excessive resource demands. We see that all fifty five-observation queries 
in the network were each answered in less than half a second. Forty eight ten-observation queries, 
forty five fifteen-observation queries, and forty twenty-observation queries were answered in one 
second. There is, however, one twenty-observation query that VEj was not able to answer within 
the time and memory limits. 

In the 364-node network, three types of queries with one query variable and five, ten, or fifteen, 
observations respectively were considered. Fifty queries were randomly generated for each query 
type. Unlike in the previous section, no approximations were made. A twenty MB memory limit 
and a forty-second CPU time limit were imposed. The time statistics is shown in the right hand side 
chart. We see that VEj was able to answer all most all queries and a majority of the queries were 
answered in little time. There are, however, three fifteen-observation queries that VEj was not able 
to answer. 



11. Conclusions 

This paper has been concerned with how to exploit causal independence in exact BN inference. Pre- 
vious approaches (Olesen et al., 1989; Heckerman, 1993) use causal independencies to transform 
BNs. Efficiency is gained because inference is easier in the transformed BNs than in the original 
BNs. 

A new method has been proposed in this paper. Here is the basic idea. A Bayesian network 
can be viewed as representing a factorization of a joint probability into the multiplication of a list of 
conditional probabilities. We have studied a notion of causal independence that enables one to further 
factorize the conditional probabilities into a combination of even smaller factors and consequently 
obtain a finer-grain factorization of the joint probability. 

We propose to extend inference algorithms to make use of this finer-grain factorization. This 
paper has extended an algorithm called VE. Experiments have shown that the extended VE algo- 
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rithm, VEj , is significantly more efficient than if one first performs Olesen et al.'s or Heckerman's 
transformation and then apply VE. 

The choice of VE instead of the more widely known CTP algorithm is due to its ability to work in 
networks that CTP cannot deal with. As a matter of fact, CTP was not able to deal with the networks 
used in our experiments, even after Olesen et aVs or Heckerman's transformation. On the other hand, 
VEj was able to answer almost all randomly generated queries with and a majority of the queries 
were answered in little time. It would be interesting to extend CTP to make use of the finer-grain 
factorization mentioned above. 

As we have seen in the previous section, there are queries, especially in the 422-node network, 
that took VEj a long time to answer. There are also queries that VEj was not able to answer. For 
those queries, approximation is a must. We employed an approximation technique when comparing 
algorithms in the 422-node network. The technique captures, to some extent, the heuristic of ignoring 
minor distinctions. In future work, we are developing a way to bound the error of the technique and 
an anytime algorithm based on the technique. 
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Appendix A. Proof of Theorem 3 

Theorem 3 Suppose a list of homogeneous factors Ti and a list of heterogeneous factors consti- 
tute a tidy factorization of P{zi, Z2, . . . , Zk). If zi is either a convergent variable, or an old regular 
variable, or a new regular variable whose deputy is not in the list {z2, . . . , Zk}, then the procedure 
sum-outl(J^i, J^25 2^1) returnsatidy heterogeneous factorization of ^(2^2, . . .,2^*:). 
Proof: Suppose /i , . . . , are all the heterogeneous factors and ^1 , . . . , are all the homogeneous 
factors. Also suppose /i, //, gi, g-m are all the factors that contain zi. Then 



J2P{z1,Z2, ...,Zk) 



s 



m s 



E[(®'=i/j-)® (®;=/+i/j-)]n5«- n 9^ 




m s 



(9) 
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= [(E®'=i/^-n5»-)®(®;=/+i/.)] n 5- (10) 

zi 8 = 1 i=m + l 

where equation (10) is due to Proposition 3. Equation (9) is follows from Proposition 4. As a matter 

of fact, if zi is a convergent variable, then it is the only convergent variable in YliLi 9i due to the 
first condition of tidiness. The condition of Proposition 4 is satisfied because zi does not appear 
in /r. On the other hand, if zi is an old regular variable or a new regular variable whose 

deputy does not appear in the list Z2,..., Zk, then YliLi9i contains no convergent variables due to the 
second condition of tidiness. Again the condition of Proposition 4 is satisfied. We have thus proved 
that sum-out 1 (J^i , ^^2, 2^1) yields a flexible heterogeneous factorization of P{z2 , . . . ,Zk). 

Let e' be a convergent variable in the list Z2,...,Zk. Then zi cannot be the corresponding new reg- 
ular variable e. Hence the factor /(e', e) is not touched by sum-out l(J'^i , J^2, 2^1)- Consequently, if 
we can show that the new factor created by sum- out 1 {Ti ,!F2,zi)is, either a heterogeneous factor 
or a homogeneous factor that contain no convergent variable, then the factorization returned is tidy. 

Suppose sum-out 1 {Ti ,^2,^1) does not create a new homogeneous factor. Then no heteroge- 
neous factors in contain zi . If zi is a convergent variable, say e', then I{e', e) is the only homo- 
geneous factor that contain e'. The new factor is J2^i /(e', e), which does contain any convergent 
variables. If zi is an old regular variable or a new regular variable whose deputy is not in the list Z2, 

Zk, all the factors that contain zi do not contain any convergent variables. Hence the new factor 
again does not contain any convergent variables. The theorem is thus proved. □ 
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